Popular Historic Figures Exploration
by Julian Hernández

Database Information

Pantheon is a project developed by the Macro Connections group at The MIT Media Lab. It is a way to celebrate our acomplishments as a species by documentating our global heritage. You can find more about this dataset and project on Kaggle or Pantheon’s Official Site

This dataset gathers information on the 11,341 biographies that have presence in more than 25 languages in the Wikipedia (as of May 2013). This dataset is not restricted to any cultural domain or time period, including all biographies that are present in more than 25 different language editions of Wikipedia.

The dataset has 17 variables from which I choose 12 to analyze.
## 'data.frame':    11337 obs. of  14 variables:
##  $ full_name                  : chr  "Aristotle" "Plato" "Jesus Christ" "Socrates" ...
##  $ birth_year                 : num  -384 -427 -4 -469 -356 ...
##  $ sex                        : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ country                    : Factor w/ 196 levels "","Afghanistan",..: 63 63 80 63 63 81 37 81 180 63 ...
##  $ continent                  : Factor w/ 8 levels "","Africa","Asia",..: 4 4 3 4 4 4 3 4 4 4 ...
##  $ occupation                 : Factor w/ 88 levels "Actor","American Football Player",..: 60 60 75 60 54 45 60 67 88 60 ...
##  $ industry                   : Factor w/ 27 levels "Activism","Business",..: 24 24 25 24 20 14 24 11 15 24 ...
##  $ domain                     : Factor w/ 8 levels "Arts","Business & Law",..: 4 4 5 4 5 7 4 5 4 4 ...
##  $ article_languages          : int  152 142 214 137 138 174 192 128 141 114 ...
##  $ page_views                 : int  56355172 46812003 60299092 40307143 48358148 88931135 22363652 43088745 20839405 26168219 ...
##  $ average_views              : int  370758 329662 281771 294213 350421 511098 116477 336631 147797 229546 ...
##  $ historical_popularity_index: num  32 32 31.9 31.7 31.6 ...
##  $ antiquity                  : num  2397 2440 2017 2482 2369 ...
##  $ hpi                        : num  32 32 31.9 31.7 31.6 ...

Variable Description

full_name: Name of the historical figure
birth_year: Birth Year of the historical figure in the Gregorian Calendar
sex: Biological Sex of the historical figure (Male / Female)
country: Country or modern day equivalent where the historical figure was born.
continent: Continent where the historical figure was born.
occupation: Profession the historical figure had.
industry: Industry of the historical figure.
Domain: Domain of Knowledge where the historical figure excelled.
article_languages: Number of Languages the historical figure biography is present in Wikipedia.
page_views: Total number of pageviews in all languages.
average_views: Total Number of pageviews divided by number of languages the biography is in.
HPI or historical_popularity_index: A more complex way to measure the historical impact by using the number of languagues, page views, age of the character, and the variation in pageviews per language.

##   full_name           birth_year        sex                 country    
##  Length:11337       Min.   :-3500   Female:1495   United States :2169  
##  Class :character   1st Qu.: 1791   Male  :9842   United Kingdom:1147  
##  Mode  :character   Median : 1919                 France        : 866  
##                     Mean   : 1658                 Italy         : 809  
##                     3rd Qu.: 1961                 Germany       : 748  
##                     Max.   : 2005                 Unknown       : 433  
##                                                   (Other)       :5165  
##          continent               occupation               industry   
##  Europe       :6366   Politician      :2528   Government      :2703  
##  North America:2439   Actor           :1193   Film And Theatre:1374  
##  Asia         :1188   Soccer Player   :1064   Team Sports     :1230  
##  Africa       : 419   Writer          : 953   Music           :1054  
##  Unknown      : 406   Religious Figure: 517   Language        : 998  
##  South America: 366   Singer          : 437   Natural Sciences: 736  
##  (Other)      : 153   (Other)         :4645   (Other)         :3242  
##                   domain     article_languages   page_views       
##  Institutions        :3453   Min.   : 26.00    Min.   :     1965  
##  Arts                :2866   1st Qu.: 29.00    1st Qu.:   628928  
##  Sports              :1756   Median : 35.00    Median :  1603951  
##  Science & Technology:1366   Mean   : 40.77    Mean   :  4202224  
##  Humanities          :1328   3rd Qu.: 46.00    3rd Qu.:  4485693  
##  Public Figure       : 358   Max.   :214.00    Max.   :145250649  
##  (Other)             : 210                                        
##  average_views     historical_popularity_index   antiquity     
##  Min.   :     49   Min.   : 9.879              Min.   :   8.0  
##  1st Qu.:  18442   1st Qu.:20.432              1st Qu.:  52.0  
##  Median :  43871   Median :23.027              Median :  94.0  
##  Mean   :  89439   Mean   :22.308              Mean   : 354.7  
##  3rd Qu.: 107243   3rd Qu.:24.589              3rd Qu.: 222.0  
##  Max.   :1515232   Max.   :31.994              Max.   :5513.0  
##                                                                
##       hpi        
##  Min.   : 9.879  
##  1st Qu.:20.432  
##  Median :23.027  
##  Mean   :22.308  
##  3rd Qu.:24.589  
##  Max.   :31.994  
## 

Preliminar Questions

Is there any guiding principle for historical popularity, maybe location, profession or time period? Which professions are more historically memorable or significant? Is there any difference in the behavior of men and woman in the data? How does location influences historical popularity? Is there any difference in the occupations that are remembered per location?

Univariate Plots Section

Year of Birth

Year of Birth: Most globally notable figures were born on recent times. Nevertheless there is huge variability, dates range from -3500 BC to 2005 AC.

The data follows an upward trend towards recent times but it has a dip on the most recent years (1995 - 2010). Which could be due to the fact that people born on those years are fairly young and still need more time to develop their careers and skills. It could be also due to the specialization of skills needed to work on contemporary times, which makes teams more necessary and teams do not appear on the dataset.

Sex:

There is a huge difference in the amount of men and women in the dataset. In this exploration it wont be possible to explain this imbalance, due to the fact that we are not looking at the complete dataset of biographies in wikipedia but a subset, the ones translated to more than 25 languages.

Location

In the past graphs we can see the distribution of great people among continents and countries.

In terms of continents, most notable people come from Europe, followed by North America and then Asia. In terms of countries, the United States is followed by several european countries ( France, Germany, the UK and Italy).

This reflects Wikipedia bias, a bias the authors of the dataset identify as an issue. Wikipedia is more popular in Western countries where European history and figures are studied and remembered. Continents with history that spans millenia and that have huge populations such as Asia or Africa are lacking in representation.

Industry & Domain

Here we have the profession of the notable people on the dataset in decreasing levels of granularity. We can see their field divided by: occupation, industry and domain.

The most common occupation in the dataset is that of politician, followed by several profession in the entertainment industry, football player, actor and writer. This translates to the distribution of great people by industry and domain, where most of them are located in goverment, the Arts and sports.

Some small bits that surprised me is the amount of religious figures that are well know and remembered. I also though that a career on the entertainment industry would be the leading occupation and that entertainment would be the leading domain. Since Wikipedia is volunteer run and celebrities tend to be way more popular than any politician. Although politician do have an historical advantage, politics as an occupation has existed for a long time.

Number of Languages

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.00   29.00   35.00   40.77   46.00  214.00

The number of languages is skewed to the left with more than 75% of the dataset being translated to 46 languages or less. There are several outliers, like, the biggest value in the dataset, 214 languages, that belongs to the article of Jesus Christ, who is followed by Barack Obama (200 languages) and US actor Corbin Bleu (193).

We can clearly see a downward exponential trend with languages.

Total Views & Average Views

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1965    628928   1603951   4202224   4485693 145250649

Page views are higly skewed to the right, using log10 its possible to see a better representation of the data how it peaks around 1.000.000 views and start declining again at 10.000.000. We can see that happening using the median, 50% of the values are below 1.000.000 and 75% of the values are below 4.000.000.

Nevertheless page views is filled with outliers, which are people that have changed the course of history or impacted their fields with a breath of fresh ideas. The poeple with most views on Wikipedia, on this dataset are: Michael Jackson, Adolf Hitler and Justin Bieber. An unlikely trio.

Average Views

Average Views is obtained by dividing the total number of page views by the amount of languages the article is available.

Average Page Views behaves similar to total page views, is skewed to the right and when log10 is applied it normalizes its distribution. Nevertheless it doesnt’t measure the same as total page views. What average page views tell us is who is consistently more popular across several languages, but it is a mean measurement which still means that it is susceptible to large values inflating it. Perhaps a median would have been more useful to know the distribution of the page views.

The biographies with the highest Average Page Views count are Kim Kardashian, Lil Wayne and Eminem. Which could either mean that Kim has a large international audience or that few specific countries drive their views up, so that they have a high page view number in comparison to the amount of languages their bio is translated to.

Historical Popularity Index (HPI)

The HPI is calculated taking into account several key indicators. Such as the time passed since the historical figure lived, number of times their biography has been visited,the number of views in different languages and other key factors.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.879  20.432  23.027  22.308  24.589  31.994

We can see it’s distribution resembles a normal one, but it a little skewed to the right. With most values being in the 20 - 24 range.

========================================================

Univariate Analysis

What is the structure of your dataset?

There are 11337 notable people in the dataset with 12 features (full_name, birth_year, sex, country, continent, occupation, industry, domain, article_languages, page_views, average_views, historical_popularity_index). There are several factors: continent, sex, industry and domain. But none of them are ordered.

Other notable details of the variables are:
- A huge disparity between the quantity of men and woman.
- Most notable people come from Europe and worked in goverment institutions.
- The most common occupation is politician
- The country with the most great people is the United States.
- Most people on the dataset have between 628928 and 4485693 views.

What is/are the main feature(s) of interest in your dataset?

The features in the data set I’d like to explore are gender and the historical_popularity_index. I’d like to determine the relationship between these variables with the rest the dataset. I’d also like to know if there are variables that predict “historical relevance” and how good of a predictors are they.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think domain, birth year, continent and the amount of languages the biography is translated to; might give an insight to the kind of historical figure that becomes popular on Wikipedia.

Did you create any new variables from existing variables in the dataset?

No, I only used existing variables.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Most of the numeric variables are skewed to either the right or the left, most of them have lots of extreme values. In some of the variables, like page_views and average_views, I used log10 to transform the data so its distribution could be more easily seen and it could be more easily managed.

I also transformed birth_years from a factor variable to a numeric variable. It was what made the most sense in order to visualize and analyze it.

I also dropped variables such as longitude, latitude. The information they gave is pretty specific and is not something that a variable like continent or country wouldn’t tell me. I also dropped the state variable due to it’s specificity, it only applied to the US.

========================================================

Bivariate Plots Section

There are two variables in the data I want to explore, Sex and HPI. During this exploration I will search for relationships between the other variables and those elements.

Sex:

Females tend to have more views than males on both, total views and average. But, men tend to have more outliers on the upper part of the boxplot. The only boxplot with a different behavior is the one for historical popularity, here men tend to have higher values than women, but men tend to have more outliers, just that in this case most of the outliers are on the lower part of the boxplot.

Both Male and Females have similar distributions of languages in their biographies.

Birth year

We see the same explosion after 1750 in both sexes, but women lag behind for 250 years before seeing a truly significant growth. More women are known in recent times (1750 & up) than in the rest of human history. Nevertheless men have way more records than women, as we had seen in our previous analysis.

We also see the same behavior that we observed before on both genders. Both of them rise until recent times where they have a small dip.

Location

Female historical figures were born mostly in the United States and follow a similar distribution to the general view we explored before. In other words, most of them where born in european countries or the US. One relevant difference between the two sexes is that most of the figures with Unknown nationality are men, not women.

On the continent scale, Europe is barely holding to the first post in terms of women notable figures, North America is quickly catching up to them.

The rest of the world, Africa, South America and OCeania, have almost no women notable figures.

Profession

At first glance we can see that among the less populated occupations in the dataset the division between genders gave us a clear divide. Some of the occupations in where this divide is evident are: gymnast, pornographic actor, model or companion tend to be roles fulfilled by women, there are few notable men on these roles. Meanwhile men dominate in roles such as historian, inventor, explorer or composer. Popular occupation with few notable women.

Most notable women work in the Arts, specifically as actresess. Which we can see reflected on the distribution on the industry and domain fields.

Here we see a big departure from the distribution we saw on the Domain univariate analysis. The distribution of domains for women are remarkably different than the one from men or the general one.

Bivariate Analysis: Gender Exploration

Talk about some of the relationships you observed in this

part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

First of all, there were some differences in occupations between men and women. Most notable men worked in institutions while women worked in the Arts, especifically as actresses. The most popular profession for men was politician.

In terms of location, most notable women came from the Unites States. With several European countries at it’s tail. Looking at the data from a continent perspective we see that Europe is the place where most notable women are born, followed by the US, Asia and a lack of figures coming from the rest of the world. Men location distribution is similar to the general one.

Also, A big difference comes in terms of popularity, women tend to have more pageviews,in total and in average, than men. But men tend to be have a higher HPI than women. Both have similar distribution of languages.

Through this exploration we could see the most common country for notable women to come from and which field would a woman most commonly work in. Nevertheless that is only the beginning. I would like to know from if there are relationships between views, country and occupation. For example:
- Is a Male French artist more popular than an American one? How about a Female one?
- Which is the most common occupation for notable women per country? How about by continent? And for men?
- What kind of occupation receives the most page views on Wikipedia?

========================================================

HPI

Page Views & HPI

## 
##  Pearson's product-moment correlation
## 
## data:  db$hpi and db$page_views
## t = 10.046, df = 11335, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0756611 0.1121524
## sample estimates:
##       cor 
## 0.0939383

An initial analysis of the relationship between page views and HPI, seems to point to a lack of relationship between the two, The scatterplot and the pearson correlation index also guide us in that direction.

In order to explore the relation of the page view variable with the HPI a bit more I decided to use log10 to normalize the distribution of page view.

## 
##  Pearson's product-moment correlation
## 
## data:  db$hpi and log10(db$page_views)
## t = 5.9375, df = 11335, p-value = 2.978e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03731294 0.07401490
## sample estimates:
##        cor 
## 0.05568273

After applying log10 to the variable we see that there is still no clear relationship between a HPI and page views.

I double checked using the pearson correlation index, which is lower than it was before, almost 0, meaning “No relation”.

Average Views

## 
##  Pearson's product-moment correlation
## 
## data:  db$hpi and db$average_views
## t = -9.5395, df = 11335, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.10747533 -0.07095238
## sample estimates:
##         cor 
## -0.08924385

I decided to run a similar process in average views as I ran in page_views. On the scatterplot we can se a lack of relationship between the variables. An hypothesis that is confirmed by the pearson correlation index, -0.08, which indicates a lack of relationship.

I run the test again with log10, to see if normalizing the distribution of the variable helps.

## 
##  Pearson's product-moment correlation
## 
## data:  db$hpi and log10(db$average_views)
## t = -7.1522, df = 11335, p-value = 9.069e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08532955 -0.04867882
## sample estimates:
##         cor 
## -0.06702679

Just as with page_views, using log10 makes the pearson coefficient go down.

Both average and total page views seem to have little relevance in the importance of an historical figure.

Birth Year

In order to explore correlation between hpi and birth year I tranformed the birth year variable to avoid negative numbers using the following formula: antiquity = 2013 - birth_year.

We substract from 2013 since the dataset was updated last on May 2013.

Using the scatterplot we can see that recent great people then to have the most variance in HPI and older historical figures tend to have higher HPI. On the other side recent notable people are the only ones in the dataset with extremely low HPI values.

## 
##  Pearson's product-moment correlation
## 
## data:  db$hpi and log10(db$antiquity)
## t = 109.33, df = 11335, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7073565 0.7252787
## sample estimates:
##       cor 
## 0.7164358

Using the pearson correlation index we can see that antiquity and HPI are related. It could be one of the potential variables used to predict HPI.

Average Languages

There seems to be a vague positive relationship between HPI and # of Languages. The scatterplot is too disperse. I used pearson to check the relationship between the two.

## 
##  Pearson's product-moment correlation
## 
## data:  db$hpi and db$article_languages
## t = 52.566, df = 11335, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4277944 0.4573965
## sample estimates:
##       cor 
## 0.4427161

Languages and HPI have a high correlation index in comparison to page views or average views. Nevertheless 0.4 is still too low to consider a strong relationship between the two.

Profession & Domain

In professions there are winners there losers. In the past graphic we can see are occupations that never rise above the median HPI, the worst case is the one of swimmers which fall behind every other profession. On the other side, The profession with the highest median HPI is philosophy.

Other low performing occupations include: gymnast, skater, skier or tennis player.

Other high performing occupations are: pirate, public worker or explorer.

In the past graphs we can see the picture painted in the occupation boxplots come to life.

Industries related with sports do poorly on the HPI. Meanwhile, philosophy, history and fine arts are the ones with the highes HPI.

I have to point out that even though Fine Arts is doing great other fields of the arts are not that lucky: Music, Film & Theater and Media Personality are some of the worst performing industries.

Finally it all comes together on domain, the Humanities (where philosophy is located) soar along with Institutions. While the Arts and Sports are the fields with the lowest HPI.

Location

Most countries fall under the median HPI line. With the country with the lowest HPI being Swaziland. The countriy with the highest HPI shouldn’t come as a surprise since their historical colaborations are well know. It is also known as the birthplace of Western Civilization, Greece.

It also important to point out the behavior of the countries with the most notable people, the US, UK and France. The UK and the US have their respective medians below the global one and have lots of outliers in either side. Meanwhile, France behaves similarly to Germany and Italy. They have HPI higher than the global but also a lot of outliers pulling them down.

Finally I have to point out that Unknown country has one of the highest HPIs.

In terms of domain we see the past behavior reinforced. Most continents have low HPI levels only Europe, Asia and Unknown have median higher than the global median.

I would also like to point out that North America has most of its values below the mean.

Sex

We analyzed the relationship between sex and HPI before. Men tend to have higher HPI values than women.

## db$sex: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.879  18.328  21.238  20.889  23.512  30.037 
## -------------------------------------------------------- 
## db$sex: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.25   20.90   23.20   22.52   24.70   31.99

General

## [1] "antiquity"         "domain"            "continent"        
## [4] "article_languages" "sex"               "page_views"       
## [7] "average_views"     "hpi"

Bivariate Analysis: HPI

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in

the dataset?

They main takeaway of the exploration of HPI was the lack of relationship between it and most variables. The exception being, Antiquity that has a strong correlation with HPI, and # of Languages.

In terms of categorical variables of factors. In the next phase of the exploration I will only use the less granular format, domain for occupations and continent for countries.

The Continent variable has a couple of surprises, Unknown is the highest median on the dataset followed by Europe or Asia. Surprisingly the US lacks behind. Continents like South America, Africa and Oceania have the lowest median overall.

In terms of profession we can see that most domain variables have a median that is similar to the global one. With the exception of Sports that is way below and Humanities & Institutions that are above the global median.

In this categorical variables we saw that the HPI distribution between factors is not equal, some have higher or lower HPIs. This proves helpful to our modelling ambitions.

Did you observe any interesting relationships between
the other features (not the main feature(s) of interest)?

Yes, average_views and page_views have a high correlation index. Also Number of Language a biography has been translated to is related to number of page views it receives.

Also there is a curious relation between views, antiquity and hpi. Just as with HPI, recent notable figures tend to have more variance but as a figure gets older they tendto receive less views. Which is the opossite of what happens with HPI, as a figure gets older it becomes more historically relevant.

========================================================

Multivariate Plots Section

Sex:

I come to this part of the analysis with questions already in mind, like: - Which profession domain recieve more visits depending on gender?
- Which continent has more notable women?

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

Sex, Occupation and Popularity.

As we had seen from our analysis before, Women tend to have more views than men. Now we see where does views concentrate.

Women receive more significantly more pageviews in the following occupations: Outlaws, Dance or Design. These are fields where women are more popular than men in most cases. Nevertheless, Women receive more pagevies than their male counterparts in almost every industry but men tend to have more outliers after the 75% quantile. Which means, women as a group tend to be more popular, but the notable people with the most page views in an specific industry with the most page views is likely to be male.

We see a similar behavior on the domain variable. Women tend have more page views than men in most domains except in Sports and institutions. But We see the same behavior with outliers that we saw before. Although men as a group tend to be less popular than women as a group, men notable figures tend to be more viewed. The exception to this happens on the Arts and Exploration domain, where women tend to have more views and there are few or no male outliers.

Sex, Continent and Page Views

Here we see the same behavior that we saw on the location variables, Women tend to be have a higher median and 75% quantile but men have more ouliers which end up meaning that the most viewed people on each continent are likely men. Only North America and Oceania behave differently, they have a higher median and 75% quantile with few or no male outliers.

Sex, Location, Views and Occupation

Lets start by dissecting each domain by the number of page views they get on each continent.

Arts:

The first thing that pops up is the lack of women artist whose birth location is unknown. Second of all, notice that Europe has the same behavior observed before women tend to receive more views but men have way more outliers with high page view amounts. Other than that we see that in Asia, Africa and North America women tend to have more slightly more views. Artist from Oceania and South America tend to have more pageviews if they are male.

In other words the most popular male artists will surely come from Europe and the female artist with more pageviews will come from North America.

Business & Law:

Women sure got the short straw on this domain. Most continents don’t have notable business women and the ones that do, North America, Asia and Europe, receive way less page views than their male counterparts in those continents.

Exploration:

The behavior is very similar to Business & Law with the addition of African Explorers.

Humanities:

Perhaps the most eye catching fact of this section is the high amount of views that South American women tend to have. Other than that we see the a similar behavior to other fields of knowledge, women tend to have slightly more views, men have more outliers with higher values.

Institutions:

We see the same pattern here, women have more views, men have more outliers. Nevertheless I want to point out a couple of details, the great amount of views women in institutions get in South America and the huge disparity between men and women in North America.

Public Figures:

This one is pretty mixed up. Men are popular in some continents, Women in others. There are not that many outliers and no clear structure.

Science & Technology:

Most continents have few women scientist but in those who do we see they have slightly more page views in general, the only continent that has more views than expected is Africa, where women tend to have way more than their male counterparts. Nevertheless we do see the same behavior with outliers than before.

Sports:

This is the male dominated industry. Men have tend to get more page views and have lots of outliers in with really high values. Women tend to have less page views than the global median.

Conclusion: The behavior we saw mean women tend to be more popular and receive more pageviews in most fields and continents but men tend to have the most viewed notable people in most fields and continents with outliers that have thousands of views more than most popular women.

I decided to explore antiquity in relation to sex to see which how the different occupations and continents behaved. I decided not to explore # of Languages due the fact that we know from past analysis that it is consistent among genders.

Antiquity, Sex, Location and Occupation

There are old profession in human story, the ones related to exploration, institutions and the humanities (philosopy, religion) are some of them. This is Evident when you look at t. The oldest group is the one belonging to notable men from an unknown origin that worked in institutions, their median antiquity is around 900 years.

There are few differences between sexes in this scale.

HPI, Sex, Location and Occupation

We know that men tend to have higher HPI than women, but this division allows us to see how deep that difference goes. Most occupations in most continents have men with a higher HPI than women.

The only fields and continents in which women tend to have a higher historical popularity index are: Asian women working in Science, African women working in institutions,

Multivariate Analysis: Sex

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Most of the plots confirmed what we already knew about the data. Women tend to have more page views. Men tend to have a higher HPI value. And most of the values in the dataset are from the last 300 years.

Nevertheless we still found interesting insights. Women tend to have more page views as a group, but men tend to have more outliers with really high values. Which means few men receive the most visits and become the face of their field but women an “average” women tends to receive more visits than an “average” men.

In terms of antiquity we see that most domains tend to have low antiquity values with the exception of Exploration, Humanities and Institutions, the oldest of professions. Men and Women in these domains tend to have high antiquity values in continents like Europe or Asia, the other continents have values that are concentrated on recent times. Even though we see both men and women from ancient times be remembered, we have to point out that men have more outliers which means that as we go further into our past we only remember men, women do not appear on the record.

Finally, Sex interaction with HPI. We know that men have a higher median HPI and that is reflected in most combinations of Continent, Domain and Sex. There is definetly a gap between men and women specially in the domain of business & law and exploration. The Domain least likely to be remembered is Sports, both men and women have low HPI values, that are way below the median. Women, specifically, don’t have great chances of having a high HPI if they work in the Arts domain. Although Men on the Art field have a higher chance of having high HPI values.

In Conclusion, It is no secret that women have had the short end of the stick throught history and through this analysis of notable people I wanted to dig depper into that. Although in later years there have been more notable women than in all of history I would like to point out that most of those women are actresses, there are few women politicians, generals, game designers, historians, you name it. I would also like to point out that even though men have a higher HPI than women this is mostly driven by antiquity, as we saw on our bivariate analysis, so the lack of women in history becomes a self fulfilling prophecy: older notable people have higher HPI, men have more antique records than women, therefore women will constantly be under the median value. Overall, this was a fun dataset to analyze in terms of gender, which leads us to the next question is HPI something we can predict using the data we have? would our predictions be trustworthy?

========================================================

Multivariete Plots: HPI

HPI, Languages, Continents and Domain

The number of languages and HPI had a weak positive correlation between them. Nevertheless there still are things to find, by using color we can see that there is a clear distribution between domains. Sports is on the bottom with lower HPI values and domains like the Humanities or Institutions tend to stay on top.

When analyzing the distribution of continents in the scatterplot we see that no such relationship emerges. There is no clear distinction or areas for each color.

I also explored the distribution of women in the scatterplot but due to the small numbers in the dataset they get drowned out by men. Besides there is no clear pattern to their behavior.

HPI, Antiquity, Continents and Domain

Most records on the dataset are from very recent times. Nevertheless patterns still emerge.

On the Domain Visualization we see that there is a big mesh of occupations in recent times with very different values. But that as we tend to go further into the past few professions, with high HPI, remain.

It’s also clear that must domains have lots of variances, specially in recent times but a domain that has consistently low HPI values is Sports.

On the other hand, continent is not a good variable to include, it doesn’t unveils any hidden pattern in the data, the variance in each continent is too big. Finally as we go further into the past we see that there is no clear “winner” as it happened with Domain, many continents have old notable figures.

HPI, Page Views, Continents and Domain

In terms of pageviews we see a reflection of our pasts analysis, Domain shows the same behavior we saw before, with people on the Humanities having high HPI values but sports having low HPI. We also see that having more views doesn’t mean you’ll have a high HPI in any domain.

In terms of continents, Europe dominates. But just as before there is a lot of variance in each factor, for example: Europe tends to have between 20 - 25 HPI values but they have lots of notable figures with values way below that range. There is also no relation between any continent, page views and HPI.

We explored page views to see if there were relationships between the categorical variables and HPI we had not seen, but it gave us information we already had.

Multivariate Analysis: HPI

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

During this multivariate analyisis of the HPI variable we confirmed the way it interacted with other variables, such as its interaction with domain, average language and birth years. We saw how HPI tends to be higher the more antique a notable person is and that there are some profession that tend to be common in ancient times.

We see a pattern form when you apply color to the language and HPI scatterplot, with people working in the humanities and institutions having a higher HPI and people on sports having a lower HPI.

Finally I created visualizations using page_views to explore possible relationship that might have passed unnoticed before. But no such thing happened.

Although we found several relations in the dataset there are few that are highly correlated with the HPI variable. Which could prove tough when building the model.

Were there any interesting or surprising interactions between features?

I created several test visualizations for sex, hpi and other variable like: average_language, antiquity or page_views. It is obvious looking at the imbalance between the number of men and women but I was surprised by how few women are in the dataset.

General

A way to look at relationship between variables before creating a model.

## [1] "antiquity"         "domain"            "continent"        
## [4] "article_languages" "sex"               "page_views"       
## [7] "average_views"     "hpi"

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Through the analysis I found variables I had to include in the model: - Birth year has a high correlation with HPI and we know that older figures tend to have higher HPI. I would like to include it.
- # of Languages, give an idea of how much of a global present does the notable figure has which is linked to the HPI.
- Sex: Men tend to have a higher HPI than women, that fact could help the model predict outcomes.
- Domain: We have seen that some domains have higher HPI than other.
## 
## Calls:
## m1: lm(formula = I(hpi) ~ I(antiquity), data = db)
## m2: lm(formula = I(hpi) ~ I(antiquity) + domain, data = db)
## m3: lm(formula = I(hpi) ~ I(antiquity) + domain + article_languages, 
##     data = db)
## m4: lm(formula = I(hpi) ~ I(antiquity) + domain + article_languages + 
##     average_views, data = db)
## m5: lm(formula = I(hpi) ~ I(antiquity) + domain + article_languages + 
##     average_views + sex, data = db)
## 
## ================================================================================================================
##                                           m1             m2             m3             m4             m5        
## ----------------------------------------------------------------------------------------------------------------
##   (Intercept)                            21.580***      21.704***      18.946***      19.259***      18.171***  
##                                          (0.032)        (0.046)        (0.061)        (0.064)        (0.077)    
##   I(antiquity)                            0.002***       0.001***       0.001***       0.001***       0.001***  
##                                          (0.000)        (0.000)        (0.000)        (0.000)        (0.000)    
##   domain: Business & Law/Arts                            0.459          0.642**        0.494*         0.186     
##                                                         (0.238)        (0.208)        (0.206)        (0.202)    
##   domain: Exploration/Arts                               1.306***       0.823***       0.387          0.162     
##                                                         (0.245)        (0.214)        (0.214)        (0.209)    
##   domain: Humanities/Arts                                1.995***       1.627***       1.237***       1.040***  
##                                                         (0.082)        (0.072)        (0.075)        (0.074)    
##   domain: Institutions/Arts                              1.037***       0.924***       0.513***       0.286***  
##                                                         (0.066)        (0.057)        (0.062)        (0.061)    
##   domain: Public Figure/Arts                             0.306*         0.453***       0.362**        0.756***  
##                                                         (0.137)        (0.119)        (0.118)        (0.117)    
##   domain: Science & Technology/Arts                      1.379***       1.282***       0.820***       0.542***  
##                                                         (0.080)        (0.070)        (0.075)        (0.074)    
##   domain: Sports/Arts                                   -4.050***      -3.746***      -4.028***      -4.285***  
##                                                         (0.074)        (0.065)        (0.066)        (0.066)    
##   article_languages                                                     0.069***       0.075***       0.074***  
##                                                                        (0.001)        (0.001)        (0.001)    
##   average_views                                                                       -0.000***      -0.000***  
##                                                                                       (0.000)        (0.000)    
##   sex: Male/Female                                                                                    1.438***  
##                                                                                                      (0.061)    
## ----------------------------------------------------------------------------------------------------------------
##   R-squared                               0.178          0.476          0.600          0.609          0.627     
##   adj. R-squared                          0.178          0.476          0.600          0.609          0.627     
##   sigma                                   3.044          2.432          2.124          2.100          2.051     
##   F                                    2461.304       1286.306       1891.325       1766.524       1734.279     
##   p                                       0.000          0.000          0.000          0.000          0.000     
##   Log-likelihood                     -28705.452     -26155.965     -24619.061     -24491.561     -24221.710     
##   Deviance                           105027.711      66984.440      51076.764      49940.730      47618.989     
##   AIC                                 57416.903      52331.930      49260.122      49007.121      48469.420     
##   BIC                                 57438.911      52405.289      49340.816      49095.151      48564.786     
##   N                                   11337          11337          11337          11337          11337         
## ================================================================================================================
## 
## Call:
## lm(formula = I(hpi) ~ I(antiquity) + domain + article_languages + 
##     average_views + sex, data = db)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9341  -1.0692   0.2235   1.2465   8.2409 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.817e+01  7.741e-02 234.753  < 2e-16 ***
## I(antiquity)                1.308e-03  3.073e-05  42.565  < 2e-16 ***
## domainBusiness & Law        1.859e-01  2.016e-01   0.922    0.357    
## domainExploration           1.622e-01  2.088e-01   0.777    0.437    
## domainHumanities            1.040e+00  7.363e-02  14.125  < 2e-16 ***
## domainInstitutions          2.863e-01  6.149e-02   4.657 3.24e-06 ***
## domainPublic Figure         7.559e-01  1.165e-01   6.486 9.15e-11 ***
## domainScience & Technology  5.417e-01  7.407e-02   7.313 2.79e-13 ***
## domainSports               -4.285e+00  6.561e-02 -65.311  < 2e-16 ***
## article_languages           7.418e-02  1.184e-03  62.636  < 2e-16 ***
## average_views              -2.651e-06  1.923e-07 -13.788  < 2e-16 ***
## sexMale                     1.438e+00  6.119e-02  23.498  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.051 on 11325 degrees of freedom
## Multiple R-squared:  0.6275, Adjusted R-squared:  0.6271 
## F-statistic:  1734 on 11 and 11325 DF,  p-value: < 2.2e-16

Overall, the model is not very trustworthy. It R and R-Squared values are very low which means that the model fails to explain most of the variance around its mean. But the p-value or significance code of most variables is way below 0.05 which means the variables have a significant relationship with HPI.

The model could use some improvement but with the variables at our disposal is the most accuarate one we could have built.


Final Plots and Summary

Plot One

Description One

One of the guiding variables in our analysis. HPI is a little skewed to the right but doesn’t have a long tail. Most notable people have HPI values around 22 and 26, the max values is 33. Even among notable people in history there are some more notable than others.

Plot Two

Description Two

This plot show us some of the variables that according to pearson correlation are related to HPI. We see that the more time has passed since a notable person birth, the higher their HPI value will be. It also shows us that the further you go into the past the higher the probability that you belong to certain industries. It also shows us how some industries have really lows HPI values, like sports.

Overall this graph gives us many key insights about the dataset in a compact way.

Plot Three

Description Three

I spent a lot of time analyzing how women behaved on the dataset in comparison to men. So I would like to give an overview, the first thing that we need to adresss is the huge difference in number between male and female great people.

Nevertheless these women know their stuff and tend to have more average and totalviews than men, by a lot, but men tend to have more outliers. Which means a few men tend to be the notable people with the more views, but men as a group tend to receive less views than women.

In terms of Languages both men and women have been translated a similar number of times. They have an almost identical distribution.

Men tend to have higher HPI, meaning that more men considered more “important” in historical terms than women. Its important to point out that the 25 quantile for males is higher than the median for women, so, 75% of male notable figures have a higher HPI than 50% of women.

Finally, most notable people in the list were born quite recently. Most women in the dataset where born 150 years ago or less. Men, on the other hand, also exploded quite recently but cover a longer period of time than women.


Reflection

The main struggle was the huge amount of categorical variables with lots of factors like country or occupation. The information they had was interesting and fun to be explored but the huge amount of factors made it hard to analyze and compare. I really liked to see how all of this notable people where distributed among different the different categories, gender, occupation and location. There where also some oddities here and there like actor Corbien Bleu having his bio translated into so many languages.

There is still so much you can do with this dataset. The people at Pantheon created a Golden Age variable that lets you know when a certain country had more notable people than other nations on the same period of time. It could also be nice to see how the porcentage of great people by proffesion and continent changes distribution per time period.

Overall is a cool dataset.